{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(hpu_resnet_training)=\n",
    "# ResNet Model Training with Intel Gaudi\n",
    "\n",
    "<a id=\"try-anyscale-quickstart-intel_gaudi-resnet\" href=\"https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=intel_gaudi-resnet\">\n",
    "    <img src=\"../../../_static/img/run-on-anyscale.svg\" alt=\"try-anyscale-quickstart\">\n",
    "</a>\n",
    "<br></br>\n",
    "\n",
    "In this Jupyter notebook, we will train a ResNet-50 model to classify images of ants and bees using HPU. We will use PyTorch for model training and Ray for distributed training. The dataset will be downloaded and processed using torchvision's datasets and transforms.\n",
    "\n",
    "[Intel Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Intel Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).\n",
    "\n",
    "## Configuration\n",
    "\n",
    "A node with Gaudi/Gaudi2 installed is required to run this example. Both Gaudi and Gaudi2 have 8 HPUs. We will use 2 workers to train the model, each using 1 HPU.\n",
    "\n",
    "We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.\n",
    "\n",
    "Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Gaudi drivers and container runtime.\n",
    "\n",
    "Next, start the Gaudi container:\n",
    "```bash\n",
    "docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest\n",
    "docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest\n",
    "```\n",
    "\n",
    "Inside the container, install Ray and Jupyter to run this notebook.\n",
    "```bash\n",
    "pip install ray[train] notebook\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from typing import Dict\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "import torch\n",
    "from filelock import FileLock\n",
    "from torch import nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import DataLoader\n",
    "from torchvision import datasets, transforms, models\n",
    "from tqdm import tqdm\n",
    "\n",
    "import ray\n",
    "import ray.train as train\n",
    "from ray.train import ScalingConfig, Checkpoint\n",
    "from ray.train.torch import TorchTrainer\n",
    "from ray.train.torch import TorchConfig\n",
    "from ray.runtime_env import RuntimeEnv\n",
    "\n",
    "import habana_frameworks.torch.core as htcore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Data Transforms\n",
    "\n",
    "We will set up the data transforms for preprocessing images for training and validation. This includes random cropping, flipping, and normalization for the training set, and resizing and normalization for the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data augmentation and normalization for training\n",
    "# Just normalization for validation\n",
    "data_transforms = {\n",
    "    \"train\": transforms.Compose([\n",
    "        transforms.RandomResizedCrop(224),\n",
    "        transforms.RandomHorizontalFlip(),\n",
    "        transforms.ToTensor(),\n",
    "        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
    "    ]),\n",
    "    \"val\": transforms.Compose([\n",
    "        transforms.Resize(256),\n",
    "        transforms.CenterCrop(224),\n",
    "        transforms.ToTensor(),\n",
    "        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
    "    ]),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Download Function\n",
    "\n",
    "We will define a function to download the Hymenoptera dataset. This dataset contains images of ants and bees for a binary classification problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_datasets():\n",
    "    os.system(\"wget https://download.pytorch.org/tutorial/hymenoptera_data.zip >/dev/null 2>&1\")\n",
    "    os.system(\"unzip hymenoptera_data.zip >/dev/null 2>&1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Preparation Function\n",
    "\n",
    "After downloading the dataset, we need to build PyTorch datasets for training and validation. The `build_datasets` function will apply the previously defined transforms and create the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_datasets():\n",
    "    torch_datasets = {}\n",
    "    for split in [\"train\", \"val\"]:\n",
    "        torch_datasets[split] = datasets.ImageFolder(\n",
    "            os.path.join(\"./hymenoptera_data\", split), data_transforms[split]\n",
    "        )\n",
    "    return torch_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Initialization Functions\n",
    "\n",
    "We will define two functions to initialize our model. The `initialize_model` function will load a pre-trained ResNet-50 model and replace the final classification layer for our binary classification task. The `initialize_model_from_checkpoint` function will load a model from a saved checkpoint if available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def initialize_model():\n",
    "    # Load pretrained model params\n",
    "    model = models.resnet50(pretrained=True)\n",
    "\n",
    "    # Replace the original classifier with a new Linear layer\n",
    "    num_features = model.fc.in_features\n",
    "    model.fc = nn.Linear(num_features, 2)\n",
    "\n",
    "    # Ensure all params get updated during finetuning\n",
    "    for param in model.parameters():\n",
    "        param.requires_grad = True\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Function\n",
    "\n",
    "To assess the performance of our model during training, we define an `evaluate` function. This function computes the number of correct predictions by comparing the predicted labels with the true labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate(logits, labels):\n",
    "    _, preds = torch.max(logits, 1)\n",
    "    corrects = torch.sum(preds == labels).item()\n",
    "    return corrects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Loop Function\n",
    "\n",
    "This function defines the training loop that will be executed by each worker. It includes downloading the dataset, preparing data loaders, initializing the model, and running the training and validation phases. Compared to a training function for GPU, no changes are needed to port to HPU. Internally, Ray Train does these things:\n",
    "\n",
    "* Detect HPU and set the device.\n",
    "\n",
    "* Initializes the habana PyTorch backend.\n",
    "\n",
    "* Initializes the habana distributed backend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_loop_per_worker(configs):\n",
    "    import warnings\n",
    "\n",
    "    warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "    # Calculate the batch size for a single worker\n",
    "    worker_batch_size = configs[\"batch_size\"] // train.get_context().get_world_size()\n",
    "\n",
    "    # Download dataset once on local rank 0 worker\n",
    "    if train.get_context().get_local_rank() == 0:\n",
    "        download_datasets()\n",
    "    torch.distributed.barrier()\n",
    "\n",
    "    # Build datasets on each worker\n",
    "    torch_datasets = build_datasets()\n",
    "\n",
    "    # Prepare dataloader for each worker\n",
    "    dataloaders = dict()\n",
    "    dataloaders[\"train\"] = DataLoader(\n",
    "        torch_datasets[\"train\"], batch_size=worker_batch_size, shuffle=True\n",
    "    )\n",
    "    dataloaders[\"val\"] = DataLoader(\n",
    "        torch_datasets[\"val\"], batch_size=worker_batch_size, shuffle=False\n",
    "    )\n",
    "\n",
    "    # Distribute\n",
    "    dataloaders[\"train\"] = train.torch.prepare_data_loader(dataloaders[\"train\"])\n",
    "    dataloaders[\"val\"] = train.torch.prepare_data_loader(dataloaders[\"val\"])\n",
    "\n",
    "    # Obtain HPU device automatically\n",
    "    device = train.torch.get_device()\n",
    "\n",
    "    # Prepare DDP Model, optimizer, and loss function\n",
    "    model = initialize_model()\n",
    "    model = model.to(device)\n",
    "\n",
    "    optimizer = optim.SGD(\n",
    "        model.parameters(), lr=configs[\"lr\"], momentum=configs[\"momentum\"]\n",
    "    )\n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "\n",
    "    # Start training loops\n",
    "    for epoch in range(configs[\"num_epochs\"]):\n",
    "        # Each epoch has a training and validation phase\n",
    "        for phase in [\"train\", \"val\"]:\n",
    "            if phase == \"train\":\n",
    "                model.train()  # Set model to training mode\n",
    "            else:\n",
    "                model.eval()  # Set model to evaluate mode\n",
    "\n",
    "            running_loss = 0.0\n",
    "            running_corrects = 0\n",
    "\n",
    "            for inputs, labels in dataloaders[phase]:\n",
    "                inputs = inputs.to(device)\n",
    "                labels = labels.to(device)\n",
    "\n",
    "                # zero the parameter gradients\n",
    "                optimizer.zero_grad()\n",
    "\n",
    "                # forward\n",
    "                with torch.set_grad_enabled(phase == \"train\"):\n",
    "                    # Get model outputs and calculate loss\n",
    "                    outputs = model(inputs)\n",
    "                    loss = criterion(outputs, labels)\n",
    "\n",
    "                    # backward + optimize only if in training phase\n",
    "                    if phase == \"train\":\n",
    "                        loss.backward()\n",
    "                        optimizer.step()\n",
    "\n",
    "                # calculate statistics\n",
    "                running_loss += loss.item() * inputs.size(0)\n",
    "                running_corrects += evaluate(outputs, labels)\n",
    "\n",
    "            size = len(torch_datasets[phase]) // train.get_context().get_world_size()\n",
    "            epoch_loss = running_loss / size\n",
    "            epoch_acc = running_corrects / size\n",
    "\n",
    "            if train.get_context().get_world_rank() == 0:\n",
    "                print(\n",
    "                    \"Epoch {}-{} Loss: {:.4f} Acc: {:.4f}\".format(\n",
    "                        epoch, phase, epoch_loss, epoch_acc\n",
    "                    )\n",
    "                )\n",
    "\n",
    "            # Report metrics and checkpoint every epoch\n",
    "            if phase == \"val\":\n",
    "                train.report(\n",
    "                    metrics={\"loss\": epoch_loss, \"acc\": epoch_acc},\n",
    "                )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Main Training Function\n",
    "\n",
    "The `train_resnet` function sets up the distributed training environment using Ray and starts the training process. It specifies the batch size, number of epochs, learning rate, and momentum for the SGD optimizer. To enable training using HPU, we only need to make the following changes:\n",
    "* Require an HPU for each worker in ScalingConfig\n",
    "* Set backend to \"hccl\" in TorchConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_resnet(num_workers=2):\n",
    "    global_batch_size = 16\n",
    "\n",
    "    train_loop_config = {\n",
    "        \"input_size\": 224,  # Input image size (224 x 224)\n",
    "        \"batch_size\": 32,  # Batch size for training\n",
    "        \"num_epochs\": 10,  # Number of epochs to train for\n",
    "        \"lr\": 0.001,  # Learning Rate\n",
    "        \"momentum\": 0.9,  # SGD optimizer momentum\n",
    "    }\n",
    "    # Configure computation resources\n",
    "    # In ScalingConfig, require an HPU for each worker\n",
    "    scaling_config = ScalingConfig(num_workers=num_workers, resources_per_worker={\"CPU\": 1, \"HPU\": 1})\n",
    "    # Set backend to hccl in TorchConfig\n",
    "    torch_config = TorchConfig(backend = \"hccl\")\n",
    "    \n",
    "    ray.init()\n",
    "    \n",
    "    # Initialize a Ray TorchTrainer\n",
    "    trainer = TorchTrainer(\n",
    "        train_loop_per_worker=train_loop_per_worker,\n",
    "        train_loop_config=train_loop_config,\n",
    "        torch_config=torch_config,\n",
    "        scaling_config=scaling_config,\n",
    "    )\n",
    "\n",
    "    result = trainer.fit()\n",
    "    print(f\"Training result: {result}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Training\n",
    "\n",
    "Finally, we call the `train_resnet` function to start the training process. You can adjust the number of workers to use. Before running this cell, ensure that Ray is properly set up in your environment to handle distributed training.\n",
    "\n",
    "Note: the following warning is fine, and is resolved in SynapseAI version 1.14.0+:\n",
    "```text\n",
    "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_resnet(num_workers=2) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Possible outputs\n",
    "\n",
    "``` text\n",
    "2025-03-03 03:32:12,620 INFO worker.py:1841 -- Started a local Ray instance.\n",
    "/usr/local/lib/python3.10/dist-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0\n",
    "  _log_deprecation_warning(\n",
    "(RayTrainWorker pid=63669) Setting up process group for: env:// [rank=0, world_size=2]\n",
    "(TorchTrainer pid=63280) Started distributed worker processes: \n",
    "(TorchTrainer pid=63280) - (node_id=9f2c34ea47fe405f3227e9168aa857f81655a83e95fd6be359fd76db, ip=100.83.111.228, pid=63669) world_rank=0, local_rank=0, node_rank=0\n",
    "(TorchTrainer pid=63280) - (node_id=9f2c34ea47fe405f3227e9168aa857f81655a83e95fd6be359fd76db, ip=100.83.111.228, pid=63668) world_rank=1, local_rank=1, node_rank=0\n",
    "(RayTrainWorker pid=63669) ============================= HABANA PT BRIDGE CONFIGURATION =========================== \n",
    "(RayTrainWorker pid=63669)  PT_HPU_LAZY_MODE = 1\n",
    "(RayTrainWorker pid=63669)  PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024\n",
    "(RayTrainWorker pid=63669)  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807\n",
    "(RayTrainWorker pid=63669)  PT_HPU_LAZY_ACC_PAR_MODE = 1\n",
    "(RayTrainWorker pid=63669)  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0\n",
    "(RayTrainWorker pid=63669)  PT_HPU_EAGER_PIPELINE_ENABLE = 1\n",
    "(RayTrainWorker pid=63669)  PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1\n",
    "(RayTrainWorker pid=63669)  PT_HPU_ENABLE_LAZY_COLLECTIVES = 0\n",
    "(RayTrainWorker pid=63669) ---------------------------: System Configuration :---------------------------\n",
    "(RayTrainWorker pid=63669) Num CPU Cores : 160\n",
    "(RayTrainWorker pid=63669) CPU RAM       : 1056374420 KB\n",
    "(RayTrainWorker pid=63669) ------------------------------------------------------------------------------\n",
    "(RayTrainWorker pid=63668) Downloading: \"https://download.pytorch.org/models/resnet50-0676ba61.pth\" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth\n",
    "  0%|          | 0.00/97.8M [00:00<?, ?B/s]\n",
    "  9%|▊         | 8.38M/97.8M [00:00<00:01, 87.7MB/s]\n",
    "100%|██████████| 97.8M/97.8M [00:00<00:00, 193MB/s]\n",
    "100%|██████████| 97.8M/97.8M [00:00<00:00, 203MB/s]\n",
    "\n",
    "View detailed results here: /root/ray_results/TorchTrainer_2025-03-03_03-32-15\n",
    "To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-03-03_03-32-10_695011_53838/artifacts/2025-03-03_03-32-15/TorchTrainer_2025-03-03_03-32-15/driver_artifacts`\n",
    "\n",
    "Training started with configuration:\n",
    "╭──────────────────────────────────────╮\n",
    "│ Training config                      │\n",
    "├──────────────────────────────────────┤\n",
    "│ train_loop_config/batch_size      32 │\n",
    "│ train_loop_config/input_size     224 │\n",
    "│ train_loop_config/lr           0.001 │\n",
    "│ train_loop_config/momentum       0.9 │\n",
    "│ train_loop_config/num_epochs      10 │\n",
    "╰──────────────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 0-train Loss: 0.6574 Acc: 0.6066\n",
    "\n",
    "Training finished iteration 1 at 2025-03-03 03:32:45. Total running time: 29s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s       24.684 │\n",
    "│ time_total_s           24.684 │\n",
    "│ training_iteration          1 │\n",
    "│ acc                   0.71053 │\n",
    "│ loss                  0.51455 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 0-val Loss: 0.5146 Acc: 0.7105\n",
    "(RayTrainWorker pid=63669) Epoch 1-train Loss: 0.5016 Acc: 0.7541\n",
    "\n",
    "Training finished iteration 2 at 2025-03-03 03:32:46. Total running time: 31s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.39649 │\n",
    "│ time_total_s          26.0805 │\n",
    "│ training_iteration          2 │\n",
    "│ acc                   0.93421 │\n",
    "│ loss                  0.30218 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 1-val Loss: 0.3022 Acc: 0.9342\n",
    "(RayTrainWorker pid=63669) Epoch 2-train Loss: 0.3130 Acc: 0.9180\n",
    "\n",
    "Training finished iteration 3 at 2025-03-03 03:32:47. Total running time: 32s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.37042 │\n",
    "│ time_total_s          27.4509 │\n",
    "│ training_iteration          3 │\n",
    "│ acc                   0.93421 │\n",
    "│ loss                  0.22201 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 2-val Loss: 0.2220 Acc: 0.9342\n",
    "(RayTrainWorker pid=63669) Epoch 3-train Loss: 0.2416 Acc: 0.9262\n",
    "\n",
    "Training finished iteration 4 at 2025-03-03 03:32:49. Total running time: 34s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.38353 │\n",
    "│ time_total_s          28.8345 │\n",
    "│ training_iteration          4 │\n",
    "│ acc                   0.96053 │\n",
    "│ loss                  0.17815 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 3-val Loss: 0.1782 Acc: 0.9605\n",
    "(RayTrainWorker pid=63669) Epoch 4-train Loss: 0.1900 Acc: 0.9508\n",
    "\n",
    "Training finished iteration 5 at 2025-03-03 03:32:50. Total running time: 35s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.37318 │\n",
    "│ time_total_s          30.2077 │\n",
    "│ training_iteration          5 │\n",
    "│ acc                   0.93421 │\n",
    "│ loss                  0.17063 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 4-val Loss: 0.1706 Acc: 0.9342\n",
    "(RayTrainWorker pid=63669) Epoch 5-train Loss: 0.1346 Acc: 0.9672\n",
    "\n",
    "Training finished iteration 6 at 2025-03-03 03:32:52. Total running time: 36s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.37999 │\n",
    "│ time_total_s          31.5876 │\n",
    "│ training_iteration          6 │\n",
    "│ acc                   0.96053 │\n",
    "│ loss                   0.1552 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 5-val Loss: 0.1552 Acc: 0.9605\n",
    "(RayTrainWorker pid=63669) Epoch 6-train Loss: 0.1184 Acc: 0.9672\n",
    "\n",
    "Training finished iteration 7 at 2025-03-03 03:32:53. Total running time: 38s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.39198 │\n",
    "│ time_total_s          32.9796 │\n",
    "│ training_iteration          7 │\n",
    "│ acc                   0.94737 │\n",
    "│ loss                  0.14702 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 6-val Loss: 0.1470 Acc: 0.9474\n",
    "(RayTrainWorker pid=63669) Epoch 7-train Loss: 0.0864 Acc: 0.9836\n",
    "\n",
    "Training finished iteration 8 at 2025-03-03 03:32:54. Total running time: 39s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s       1.3736 │\n",
    "│ time_total_s          34.3532 │\n",
    "│ training_iteration          8 │\n",
    "│ acc                   0.94737 │\n",
    "│ loss                  0.14443 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 7-val Loss: 0.1444 Acc: 0.9474\n",
    "(RayTrainWorker pid=63669) Epoch 8-train Loss: 0.1085 Acc: 0.9590\n",
    "\n",
    "Training finished iteration 9 at 2025-03-03 03:32:56. Total running time: 40s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.37868 │\n",
    "│ time_total_s          35.7319 │\n",
    "│ training_iteration          9 │\n",
    "│ acc                   0.94737 │\n",
    "│ loss                  0.14194 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 8-val Loss: 0.1419 Acc: 0.9474\n",
    "(RayTrainWorker pid=63669) Epoch 9-train Loss: 0.0829 Acc: 0.9754\n",
    "\n",
    "2025-03-03 03:32:58,628 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/TorchTrainer_2025-03-03_03-32-15' in 0.0028s.\n",
    "Training finished iteration 10 at 2025-03-03 03:32:57. Total running time: 42s\n",
    "╭───────────────────────────────╮\n",
    "│ Training result               │\n",
    "├───────────────────────────────┤\n",
    "│ checkpoint_dir_name           │\n",
    "│ time_this_iter_s      1.36497 │\n",
    "│ time_total_s          37.0969 │\n",
    "│ training_iteration         10 │\n",
    "│ acc                   0.96053 │\n",
    "│ loss                  0.14297 │\n",
    "╰───────────────────────────────╯\n",
    "(RayTrainWorker pid=63669) Epoch 9-val Loss: 0.1430 Acc: 0.9605\n",
    "\n",
    "Training completed after 10 iterations at 2025-03-03 03:32:58. Total running time: 43s\n",
    "\n",
    "Training result: Result(\n",
    "  metrics={'loss': 0.1429688463869848, 'acc': 0.9605263157894737},\n",
    "  path='/root/ray_results/TorchTrainer_2025-03-03_03-32-15/TorchTrainer_19fd8_00000_0_2025-03-03_03-32-15',\n",
    "  filesystem='local',\n",
    "  checkpoint=None\n",
    ")\n",
    "(RayTrainWorker pid=63669) Downloading: \"https://download.pytorch.org/models/resnet50-0676ba61.pth\" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth\n",
    "  0%|          | 0.00/97.8M [00:00<?, ?B/s]\n",
    " 68%|██████▊   | 66.1M/97.8M [00:00<00:00, 160MB/s] [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.12 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orphan": true,
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
